Experiments in Data Compression IV (or "The Need for Speed") by John Kormylo E.L.F. Software Binary Search Tree The LZA algorithm uses a binary search tree to locate strings. Specifically it consists of an array of structures of the form typedef struct tree { struct tree *parent; /* binary tree pointers, or NULL */ struct tree *child1; /* less than */ struct tree *child2; /* greater than */ unsigned char *text; /* string pointer */ unsigned int count; /* number of children + 1 */ int first; /* number of bytes common with parent */ } TREE; with one structure for every string in the dictionary. (Note, these are not NULL terminated strings, but it is easier to say "string" than "block of 65 consecutive bytes.") One would do a string compare to 'text' and move to 'child1' or 'child2' and repeat. The 'parent' entry was used for tree balancing, but turns out not to be necessary. The 'count' entry was used both for tree balancing and for coding, and the 'first' entry was an attempt to reduce the number of byte comparisons during the string compares. (However, nothing guarentees that child strings will have ANY bytes in common.) To locate the longest match one must search until you hit a NULL pointer. Given the first (in search order) match for a given length, one must search all child strings to locate the most recent matching string (for the FIFO buffer) or search for the lowest and highest matching strings (for the dictionary). FIFO Buffer - Hash Table Actually, the method developed in the last report was more of a cross between a hash table and binary search tree. Specifically, it consisted of 16 byte structures of the form: typedef struct hash { struct hash *child1; /* less than */ struct hash *child2; /* greater than */ struct hash *next; /* start of next tree */ unsigned int sum; /* sum of all matching strings */ unsigned char index; /* index of most recent match */ unsigned char test; /* character */ } HASH; Given a pointer to the start of the table, one would search for the first byte in the string using a binary search tree using 'child1', 'child2', and 'test'. Once you have found the first byte, you can search for the second byte using the binary search tree which starts at 'next'. 'Sum' is used by the tree balancing algorithm, and 'index' is the array index for the corresponding string in the FIFO buffer. The number of structures used is variable, with the worst case being 64 * 65 * sizeof(HASH) = 65K bytes, since there are 64 strings in the FIFO buffer, and each string has at most 65 bytes. The structures are initialized as a simple linked list of free structures. Also, you will need an array of 64 string pointers and weights. (The hash table is used for coding only.) The advantage of this approach is that we have replaced string compares with byte compares. Once the first byte is found, one need only search for the second bytes, etc. Also, the 'index' value is automatically reset each time a string is added to the hash table, so one need not search for the most recent match. Main Dictionary - Layered Binary Search Tree This approach combines the speed of a hash table with the low memory requirement of a binary search tree. As before you need an array of structures, one for each string in the dictionary. In addition you will need a variable number of structures which point to the same strings, but are located in lower layers. The worst case is also the same as the number of strings in the dictionary, for a total of twice the RAM needed for a simple binary search tree. The structures used here are given by: typedef struct search { struct search *child1; /* less than */ struct search *child2; /* greater than */ struct search *next; /* start of next tree */ unsigned char *text; /* string pointer */ unsigned int sum; /* sum of all children */ unsigned char len; /* number of common chars */ unsigned char test; /* first uncommon char */ } NODE; The 'len' parameter is constant over a layer. The 'test' entry is redundant, where p->test = p->text[p->len], but is faster to access and is free (since structures always take an even number of bytes). Given a pointer to the start of the tree, one would compare the first byte against 'test' and perform the usual binary search via 'child1' and 'child2'. If a match is found, 'next' points to the start of another binary search tree, but not necessarily for the second byte. One would first have to check all the bytes in p->text[1] through p->text[p->len - 1] (where p points to the first entry in the next layer). If they all match, one can then search for a matching 'test' using the binary tree for this layer, etc. The advantage is again that you need only search for each byte once. Also, when you find a match of a given length, the number of matching entries is given either by 'sum' (if you are between layers), 'next->sum' (if it exists), or is 1. The trick is to copy the parent structure (at least, it uses the same 'text' pointer) whenever you start an new layer, then add the new string as a child structure of the copy. Similarly, whenever you remove a string from the dictionary, you have to search the next layer to find the duplicate, replace the parent with an structure from the next layer, and replace the duplicate with a copy of the new parent (unless there are no more entries on that layer). It should be remembered that the duplicate to be removed might also have a duplicate in the next layer. Tree Balancing Consider the two binary search trees organised as follows: A C \ / \ / C A / \ B B where A < B < C by value. It should be noted that any additional children of A B or C would remain in place if one were to convert from one to the other. This transformation is called a rotation. Starting with A using the left tree, suppose we wanted to add a new node D > C. Starting at A we see that we need to move to right. However, before doing so we check to see if the number of children to the right (child2->sum) is greater than the number of children to the left (child1->sum). If so, we perform the rotation. Since we are now at C instead of A, we need to repeat the comparison. Also, before moving down we should increment the 'sum' entry at C. One can avoid the need for a parent pointer by using the approach: HASH *p,**from; from = &start; p = *from; while(p) { ... if(byte < p->test) from = &(p->child1); else if(byte > p->test) from = &(p->child2); else from = &(p->next); p = *from; } Changing '*from' will change the appropriate pointer, no matter where it came from. Fine Tuning The following table shows the results for various methods of choosing which matches to use. The first number is the time required to perform the compression (accurate to 2 seconds) and the second is the size of the file. The 'ZIP' entry is used for comparison (times are not available). | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ ZIP | 25134 | 32417 | 496876 | Smart | 18 22284 | 22 32567 | 418 503524 | Best | 102 22029 | 64 31997 |1670 453142 | Best1 | 76 21746 | 54 32451 |1654 446201 | Best2 | 70 21960 | 50 32372 |1454 446607 | -------+--------------+-------------+--------------+ 8K | 20 22444 | 22 32405 | 424 496712 | 384 | 60 21979 | 40 32417 |1190 419201 | 8K384 | 60 22133 | 40 32288 |1164 421197 | -------+--------------+-------------+--------------+ The 'Smart' method is essentially the same as that developed in the previous report. It looks for the longest combination of two strings. To decide between a string from the FIFO buffer or the main dictionary, each match is ranked by the weight for that offset times the total number of entries in the main dictionary, or the sum of matches in the main dictionary times the sum of all the weights for the FIFO buffer. To decide between two combinations of strings which have the same combined length, it looks at the rank times the length of the first match. This is the fastest method which is still competitive with ZIP. The 'Best' method looks at all possible ways to compress the next 65 bytes until a unique first step is determined. Analysis by the Pure Profiler showed that most of time is now spent computing the number of bits needed to code strings, rather than searching for them. 'Best1' shows a variation of the 'Best' algorithm which only codes a single byte using arithmetic if there is no other way to get to the next byte. 'Best2' also checks if the number of bits used to get to the end of the longest combination found so far is less than the number of bits needed to get to the current byte, in which case it doesn't bother to test if any strings starting from the current byte will help. While these variations only achieve marginal improvements in speed, there is very little loss of compression, and sometimes they even out-perform the 'Best' algorithm. The best thing about all these methods is that they are totally compatible with the current versions of ELFARC and ELFBACK. However, if we relax this requirement, additional savings are possible. '8K' shows the 'Smart' algorithm results when the main dictionary is reduced from 16K to 8K entries (640K to 320K bytes). Note that this method beats ZIP on all three files. '384' is a variation of 'Best2' which uses 384 codes instead of 320. Specifically, these consist of the 256 basic codes, 64 length codes for the FIFO buffer and 64 length codes for the main dictionary. This was intended to reduce the number of steps needed to code a string (or compute the number of bits needed to code a string). Whether it benefits of hurts the compression depends on whether the distribution of lengths are similar for the two dictionaries. '8K384' uses both an 8K main dictionary and the 384 codes. Interestingly, using 384 codes had benefit for the 'Smart' algorithm whatsoever. I also discovered that a very minor improvement in speed is possible by sorting the codes in the arithmetic compression so that the most common codes are close together. Conclusion At this point the difference in compression efficiency between Smart and Best approaches is minor compared to the difference in speed. Since the '8K' method satisfies the design criteria, it will be used for the next versions of both ELFBACK and ELFARC.